How to visualise data in R

Published

Last updated: 05/09/2025 15:34

A compendium of code for visualising data in R (primarily using ggplot2).

Note: this guide uses built-in datasets from the following, amongst others:

1 Structure of a ggplot

  • Aesthetics describe mapping between visual elements and variables in the data, e.g. x-axis may be mapped to “time_point”, while colour may be mapped to “gender”.
  • Geoms are the type of visual ‘marks’ on a plot such as lines, points, or bars (covered under Plot types below): they are geometrical objects used to represent data.
data %>% 
  ggplot(aes(x = ind_var, 
             y = dep_var)) +
  geom_point(aes(colour = factor(grouping_var)),
             size = 1.5)

If you want to add a global attribute (e.g. to apply to all points, lines, or whatever), specify this outside of aes() because it is not a mapping (it doesn’t relate to something in the dataframe itself).

2 Plot types

2.1 Points

mpg %>%
  ggplot(aes(x = displ, y = hwy)) +
  geom_point(size = 1.5,
             position = "identity")

This masks the picture because there are overlapping points. Changing position to "jitter":

Adding mappings for colour and shape:

mpg %>%
  ggplot(aes(x = displ, y = hwy)) +
  geom_point(aes(colour = class,
                 shape = class),
             size = 1.5,
             position = "jitter") 

Note that this can also be written as:

mpg %>%
  ggplot() +
  geom_point(aes(x = displ, y = hwy, 
                 colour = class,
                 shape = class), 
             size = 1.5)

2.1.1 Point shapes

Note: the fill attribute can be changed to any colour for shapes 21 to 24.

2.2 Lines

gcookbook::countries %>%
  filter(Name %in% c("United Kingdom", "Ireland") & Year > 1980) %>%
  ggplot(aes(x = Year, y = GDP)) +
  geom_line(aes(colour = Name, 
                linetype = Name), 
            linewidth = 0.8)

# to add arrows to the ends of lines, add this to geom_line:
# arrow = arrow(length = unit(0.25, "cm"), ends = "last", type = "closed")

Adding a line of best fit

gcookbook::heightweight %>%
  ggplot(aes(x = heightIn, y = weightLb)) +
  geom_point(aes(colour = sex)) +
  geom_smooth(method = "lm",
              fullrange = T,
              aes(colour=sex))

Use geom_segment to add lines from points to fitted regression slope using some dummy data:

x y Fitted
12 10 11.015252
3 2 5.411141
5 4 6.656499
3 9 5.411141
1 5 4.165782
15 13 12.883289
ggplot(df, aes(x = x, y = y)) + 
  geom_point() + 
  theme_classic() + 
  geom_smooth(method = "lm", se = F) + 
  geom_segment(aes(x = x, y = y,
                   xend = x, yend = Fitted), 
               linetype = "dashed")

If you want to sum a value across factor levels using a linegraph, you need to use stat_summary (the alternative is to calculate summary values before piping into ggplot):

df %>%
  ggplot(aes(x=year, y=value)) +
  stat_summary(fun = "sum", geom = "line")

2.2.1 Linetypes

Example usage: linetype = "dashed"

2.3 Bars

2.3.1 Simple bar chart

  • geom_col leaves the data as it is and merely represents values already in the dataframe.
  • geom_bar uses stat_count by default to derive new values from the data. As a result, geom_bar doesn’t expect a y-value, but if you provide one then you are telling it to forgo the aggregation it would have done anyway with stat_count.

Using stat = "count"

palmerpenguins::penguins %>%
  ggplot(aes(x = species, fill = species)) +
  geom_bar(stat = "count", # default - can be omitted
           width = 0.8,
           show.legend = F)

Using stat = "identity" to represent pre-calculated summary stats.

gcookbook::drunk %>%
  pivot_longer(c(2:6)) %>%
  group_by(sex) %>%
  summarise(felonies = sum(value)) %>%
  ggplot(aes(x = sex, y = felonies)) +
  geom_bar(stat = "identity", width = 0.8)

You should also use stat_identity if you want to reorder bars in descending order. The syntax for reorder is reorder(what, by_what). Here, -n means descending order.

palmerpenguins::penguins %>%
  group_by(species) %>%
  summarise(n = n()) %>%
  ggplot(aes(reorder(species, -n), n)) +
  geom_bar(stat = "identity", width = 0.8, 
           aes(fill = species),
           show.legend = F)

2.3.2 Dodged bar chart

Dodged bars represent geoms for each level of some factor (here, sex), arranged side by side for ease of comparison. Note that this dataset has NAs. To remove these before graphing, use drop_na.

palmerpenguins::penguins %>%
  # drop_na(sex) %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = position_dodge(),
           width = 0.8)

2.3.3 Stacked bar chart

Stacked bars are placed on top of one another. Here is a recipe for when you already have a y-axis (e.g. counts or figures for each category of something).

palmerpenguins::penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = position_stack(), 
           width = 0.8)

2.4 Proportional stacked bar

2.4.1 When there is an explicit y-value

Use this recipe if your dataframe already has a variable representing the y-value (e.g. count). Notice use of stat_identity.

prop_df
group type count
a x 11091
a y 4583
b x 3974
b y 10984
prop_df %>%
  ggplot(aes(x = group, y = count, fill = type)) +
  geom_bar(position = position_fill(), stat = "identity", width = 0.8) +
  ylab("Proportion")

2.4.2 When there is no explicit y-value (i.e. counts of a factor)

palmerpenguins::penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = position_fill(),
           width = 0.8) +
  ylab("Proportion")

Text is dealt with later on, but to add figures to the above figure you need to manually calculate proportions for each cell first and then pipe this into ggplot. The recipe below also has some nifty code for converting decimals into nicely formatted percentages.

palmerpenguins::penguins %>%
  group_by(island, sex) %>%
  summarise(n = n()) %>% # you may need to use sum() here instead of n()
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = island, y = prop, fill = sex)) +
  geom_bar(stat = "identity", width = 0.8) +
  geom_text(aes(label = paste0(sprintf("%.1f", prop * 100), "%")), 
            position = position_stack(vjust = 0.5),
            size = 3) +
  ylab("Proportion")

# note: to change decimal places, "%.0f", "%.2f" etc

2.5 Boxplot

palmerpenguins::penguins %>%
  ggplot(aes(x = species, y = bill_length_mm)) +
  # this part produces just the I-shaped lines
  stat_boxplot(
    geom ='errorbar', 
    width = 0.5) +
  # this part produces the boxes
  geom_boxplot(
    notch = F,
    outlier.color = "red",
    outlier.size = 2.5
  )

2.6 Violin plot

A more sophisticated version of a boxplot using density curves to show the distribution for the entire range of values.

palmerpenguins::penguins %>%
  ggplot(aes(x = species, y = bill_length_mm)) +
  geom_violin(aes(fill = species),
              draw_quantiles = c(0.5)) # specify 0.5 for median

2.7 Dotplot

For when there isn’t a huge amount of data. Each dot represents a single observation.

palmerpenguins::penguins %>%
  sample_n(125, replace=F) %>%
  ggplot(aes(x=body_mass_g)) +
  geom_dotplot(dotsize = 1, width = 1)

2.8 Histogram

For when there’s more data, or binning is better. This only requires an x-axis value. geom_histogram usually does a good job of setting the bindwidth automatically but this can be manually controlled.

palmerpenguins::penguins %>%
  ggplot(aes(x = body_mass_g)) +
  geom_histogram(binwidth = 100, 
                 fill = "grey", 
                 colour = "black")

Recipe: overlay separate histograms for levels of a factor (here, cut). By default, geom_histogram will stack bars if there are multiple groups but you can change this by changing position to identity:

ggplot2::diamonds %>%
  ggplot(aes(x = carat, fill = cut)) +
  geom_histogram(binwidth = 0.1, position = "identity", alpha = 0.5)

Sometimes you may want to produce a range of histograms for each level of a factor. Here is a recipe for a function to iterate through multiple histograms, by group, using purrr:

# firstly make a vector of all the variable names you want to plot
vars = c("bill_length_mm", "flipper_length_mm", "body_mass_g")

# then create a custom graphing function
hist_fun = function(data, x, y) {
  ggplot(data, aes(x = .data[[x]], fill = .data[[y]]) ) +
    geom_histogram(alpha = 0.5, position = "identity") +
    theme_bw() +
    ggtitle(x)
}

# use purrr::map to cycle through vars to produce plots, with a constant grouping factor
purrr::map(vars, ~ hist_fun(data = palmerpenguins::penguins, .x, "sex") )
[[1]]


[[2]]


[[3]]

2.9 Density

For large amounts of data. Use bw argument to change bandwidth (e.g. if you want more smoothing).

ggplot2::diamonds %>%
  ggplot(aes(x = carat)) +
  geom_density(bw = 0.05, fill = "lightblue", 
               alpha = 0.8)

As with the multiple-group histogram example above, you may want to overlay density distributions for levels of a factor.

ggplot2::diamonds %>%
  ggplot(aes(x = carat, fill = cut)) +
  geom_density(bw = 0.05, alpha = 0.4)

2.10 Pie

Everyone knows that pie charts are illegal in the world of data visualisation, but here’s a recipe just in case…

palmerpenguins::penguins %>%
  count(species) %>%
  ggplot(aes(x = "", y = n, fill = species)) +
  geom_col(color = "black") +
  coord_polar(theta = "y") +
  geom_text(aes(label = n),
            position = position_stack(vjust = 0.5)) +
  theme_void()

2.11 Sankey

This recipe produces an interactive html Sankey chart using the networkD3 package. It may take a little set up beforehand according to the data you have.

links = data.frame(
  Var1 = c("Failed_Phonics", "Passed_Phonics", "Failed_Phonics", "Passed_Phonics"),
  Var2 = c("Failed_KS2", "Failed_KS2", "Passed_KS2", "Passed_KS2"),
  Freq = c(25114, 120356, 9019, 585893)
)

colnames(links) <- c("source", "target", "value")

nodes <- data.frame(name = as.factor(c("Failed_Phonics","Passed_Phonics","Failed_KS2","Passed_KS2")))

# Convert the source and target from characters to indices
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1

# create Sankey diagram
networkD3::sankeyNetwork(Links = links, 
              Nodes = nodes, 
              Source = "source",
              Target = "target", 
              Value = "value", 
              NodeID = "name", 
              fontSize = 12, 
              fontFamily = "sans-serif")

# function
sankey_function = function(var1, var2, lab1, lab2, lab3, lab4){
  
  tab = table(var1, var2)
  
  links = data.frame(
    Var1 = c(lab1, lab2, lab1, lab2),
    Var2 = c(lab3, lab3, lab4, lab4),
    Freq = c(tab[1,1], tab[2,1], tab[1,2], tab[2,2])
  )
  
  colnames(links) <- c("source", "target", "value")
  
  nodes <- data.frame(name = as.factor(c(lab1, lab2, lab3, lab4)))
  
  links$source <- match(links$source, nodes$name) - 1
  links$target <- match(links$target, nodes$name) - 1
  
  # create Sankey diagram
  networkD3::sankeyNetwork(Links = links, 
                           Nodes = nodes, 
                           Source = "source",
                           Target = "target", 
                           Value = "value", 
                           NodeID = "name", 
                           fontSize = 12, 
                           fontFamily = "sans-serif")
}


sankey_function(var1 = ela_final$ks4_mfl, var2 = ela_final$ks5_mfl_any,
                lab1 = "ks4_mfl_no", lab2 = "ks4_mfl_yes",
                lab3 = "ks5_mfl_no", lab4 = "ks5_mfl_yes")

3 Some more advanced recipes

Add a histogram to a scatterplot using ggExtra::ggMarginal. This requires you to set up the basic plot first and save as an object.

# create data
set.seed(30)
df1 = data.frame(x = rnorm(500, 50, 10), y = runif(500, 0, 50))
# set up base plot
p1 = ggplot(df1, aes(x, y)) + geom_point() + theme_bw()

ggMarginal(p1, type = "histogram",
           margins = "both", # note: "x", "y", or "both" 
           fill = "orange",
           xparams = list(binwidth = 2))

Create a scatterplot matrix using GGally::ggpairs. This is quite customisable depending on what you need. Check help(ggpairs)

# note: you can also add colouring by factor if you add: ggplot2::aes(colour = Species)
GGally::ggpairs(iris)

4 Plotting statistical summary data

Often we want to plot statistical summaries rather than raw data. ggplot has functions to do this “in-plot”, i.e. instead of you having to compute summaries manually beforehand. One option is to use arguments within geom_bar.

palmerpenguins::penguins %>%
  ggplot(aes(x = species, y = flipper_length_mm)) +
  geom_bar(fun = "median", 
           stat = "summary")

However, a more powerful method is to use stat_summary. You can plot several summaries together, e.g. if you want points and lines on the same plot.

palmerpenguins::penguins %>%
  ggplot(aes(x = species, y = flipper_length_mm)) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun = "mean", geom = "line", aes(group = 1)) 

Plot the mean and standard deviation using mean_sdl (use the mult argument to change how many standard deviations are shown around the mean):

mpg %>%
  ggplot(aes(x = reorder(class, hwy), y = hwy)) +
  stat_summary(fun = mean, geom = "point") +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               geom = "errorbar",
               width = .4)

Add confidence intervals (requires the Hmisc package):

mpg %>%
  ggplot(aes(x = reorder(class, hwy), y = hwy)) +
  stat_summary(fun = "mean", geom = "point") +
  stat_summary(fun.data = "mean_cl_normal",
               fun.args = list(conf.int = .95),
               geom = "errorbar",
               width = .4) 

4.1 Pointrange

This shows the mean and SE.

gapminder::gapminder %>%
  filter(year %in% c(1950:1990)) %>%
  ggplot() +
  geom_pointrange(mapping = aes(x = year, y = lifeExp),
                  stat = "summary")

5 Working with text

This diagram shows the main text elements in a ggplot2 graph, all of which are controllable (see below).

5.1 Plot title, subtitle, and caption

p + ggtitle("Main plot title", subtitle = "Plot subtitle")

# use labs to add a caption
p + labs(caption = "my caption")

# caption is right-justified by default. To change:
p + theme(plot.caption = element_text(hjust=0))

# centre align plot title
p + theme(plot.title = element_text(hjust = 0.5))

# change plot title size
p + theme(plot.title = element_text(size = 18))

5.2 Adding text to charts

If you just want to add static text somewhere on a chart, use this:

# for plot with numeric axes:
plot + geom_text(x = 100, y = 100, label = "my label", hjust = 1.2)
NULL
# for plot with date on x-axis
plot + geom_text(x = as.Date("2025-09-01"), y = 100, label = "my label", hjust = 1.2)
NULL

Recipe to add the stat to the top of a bar.

palmerpenguins::penguins %>%
  group_by(species) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = species, y = n)) +
  geom_bar(stat = "identity", width = 0.8) +
  geom_text(
    aes(label = n, vjust = -0.5),
    size = 3.5)

It’s also possible to display text for certain conditions.

# p + geom_text(aes(label = ifelse(condition == "con", paste0(round(diff_pc,1), "%"), ""), vjust = -0.5),size = 3.5)

# a more common strategy is to calculate bar counts beforehand and then plug these in later.
bar_data = ela_final %>%
  filter(!is.na(status)) %>%
  # reorder the factor so that the order of filled bars is descending (first 
  # on top, last at bottom)
  mutate(status = factor(status, levels = c("Deceased", "Cancelled", "Rejected",
                                            "Waiting list", "Matching", 
                                            "Applicant withdrawn", "Allocated"))) %>%
  count(AY, status)

ggplot(bar_data, aes(x = AY, y = n, fill = status)) +
  geom_bar(stat = "identity") +
  geom_text(
    # just show counts from "Allocated" status
    data = bar_data %>% filter(status == "Allocated"),
    aes(x = AY, y = n, label = n),
    position = position_stack(vjust = 0.5),
    color = "black",
    size = 3.5
  )

6 Axes

Note that the same functions are used for x or y axes - adjust accordingly.

# custom axis titles
p + xlab("...")
p + ylab("...")

# remove axis title
p + theme(axis.title.x = element_blank())

# change axis title text size
p + theme(axis.title.x = element_text(size = 14))

# rotate x- axis title text 45 degrees
p + theme(axis.title.x = element_text(angle = 45, vjust = 0.5))

# rotate y-axis to 0 degrees (to be read horizontally not vertically)
p + theme(axis.title.y = element_text(angle = 0, vjust = 0.5))

# remove tick marks
p + theme(axis.ticks.x = element_blank())

6.1 Axis formatting options

# percent labels with breaks of 10-100%
# note: change scale to 0 if you are working with integers already, otherwise leave at default of 100
p + scale_y_continuous(labels = scales::label_percent(accuracy = 1, scale = 100), breaks = (0:10)/10)

# currency suffix
p + scale_y_continuous(labels = scales::label_dollar(prefix = "£"))
# can also specify suffix, big.mark = ",", and decimal.mark = "."

# thousands separator
p + scale_y_continuous(labels = label_comma(big.mark = ","))

# dates and times
p + scale_y_continuous(labels = label_date(format = "%Y-%m-%d"))
p + scale_y_continuous(labels = label_time(format = "%H:%M:%S"))

Using scale_x_date: this requires the x-axis to be formatted as a date object.

# firstly, a nifty function if your date value is in the format "201516"
convert_year = function(year){
  yr = substr(year, 5, 6)
  full_yr = paste0("20", yr, "-01-01")
  full_yr_date = as.Date(full_yr, format = "%Y-%m-%d")
}

# Note that you can either supply date_labels OR a character vector of custom labels, not both. 
p + scale_x_date(date_labels = "%Y", date_breaks = "1 year")

# if your x-axis is in date format, you can try the following:
yrs = c(as.Date("2016-06-01"), as.Date("2017-06-01"), as.Date("2018-06-01"))
yrs_labs = c("15/16", "16/17", "17/18")
p + scale_x_date(breaks = yrs,
                 labels = yrs_labs)

# I've found that I get an error saying `breaks` and `labels` have different lengths, so I just supply a dummy value first and this seems to fix it (the dummy value itself isn't displayed).
p + scale_x_date(labels = c("x", "12/13", "13/14", "14/15", "15/16", "16/17", "17/18", "18/19"))

# to show specific dates only
p + scale_x_date(breaks = as.Date(c("2022-07-01", "2022-07-03", "2022-07-05")))

# sometimes it's difficult to see why axis breaks don't match up with labels. In this case it can be 
# useful to save the graph as an object and then inspect its breaks:
layer_scales(graph_object)$x$break_positions()

6.2 Limits and ranges of axes

It’s often desirable to alter the range of axes on a graph, e.g. to start an axis at zero. Different options result in different effects - specifically, limits will actually cut off data points, whereas other options like coord_cartesian will leave the underlying data alone but just display the part of the axis you’ve specified.

# coord_cartesian(xlim = c(1,1), ylim = c(1,1)) # (doesn't clip data)
# scale_x_continuous(limits = c(0, 100))

coord_cartesian alters the visible data.

iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() +
  geom_smooth(aes(colour = Species)) +
  coord_cartesian(xlim = c(4.5, 5.5))

coord_fixed specifies a ratio for the display of the two axes. A 1-to-1 ratio would be a value of 1. A value <1 compresses the y-axis.

# Also try e.g. coord_fixed(20/1), i.e. a 20-to-1 ratio where the x-axis is 20 times
# as long as the y-axis
  
iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() +
  geom_smooth(aes(colour = Species)) +
  coord_fixed(0.5)

A tip: if you want a y-axis to begin at zero but don’t know the maximum, you can do this:

p + expand_limits(y = 0)

When you have a discrete x-axis, sometimes you get a lot of space between aesthetics and the edges of the plot. To control the spacing, add scale_*_discrete(expand = c(0.1, 0.1)):

# regular plot with too much padding:
padding = mpg %>%
    mutate(year = as.factor(year)) %>%
    group_by(year) %>%
    summarise(cty_m = mean(cty, na.rm = T)) %>%
    pivot_longer(c(cty_m)) %>%
    ggplot(aes(x = year, y = value)) +
    stat_summary(fun = "mean", geom = "line", aes(group = 1))
padding

# getting rid of padding:
padding +  scale_x_discrete(expand = c(0.1, 0.1))

Clipping determines whether to display elements that would lie outside the plot panel. Expanding sets a buffer margin around a plot to prevent overlapping.

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(size = 2) +
  # zero expansion
  coord_cartesian(expand = 0,
                  clip = "off") +
  ggtitle("Clipping off")

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(size = 2) +
  # zero expansion
  coord_cartesian(expand = 0,
                  clip = "on") +
  ggtitle("Clipping on")

Use custom labels for x-axis: say you have numeric year labels in the data such as 202122 but you want to display these as “2021/22”.

# define a list of values
yr_labels = c("2018/19", "2019/20", "2020/21", "2021/22", "2022/23")

plot + scale_x_continuous(labels = yr_labels)

7 Themes

To add a pre-defined theme to a plot: p + theme_*().

Tips:

  • Place the theme argument early on in the plot sequence if you want to adjust other features e.g. axis attributes, because otherwise theme will override these.
  • Set a global theme: theme_set(theme_classic(base_size = 16)).

7.1 Analysis function theme (afcharts)

The AF has created its own package for producing accessible plots, based on ggplot2. The main function is theme_af(). It has the following arguments:

theme_af(
  base_size = 14,
  base_line_size = base_size/24,
  base_rect_size = base_size/24,
  grid = c("y", "x", "xy", "none"),
  axis = c("x", "y", "xy", "none"),
  ticks = c("xy", "x", "y", "none"),
  legend = c("right", "left", "top", "bottom", "none")
)

See this guidance: “There’s general guidelines for research report figures here https://www.gov.uk/government/publications/research-reports-guide-and-template but there’s also a package {afcharts} that formats ggplots in line with government guidelines (https://best-practice-and-impact.github.io/afcharts/). It’s also recommended to save plots as .svg files as they’re editable outside of R”.

afcharts::use_afcharts()
NULL
palmerpenguins::penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = position_dodge(),
           width = 0.8) +
theme_af(legend = "bottom")

gapminder %>%
  filter(country %in% c("United Kingdom", "China", "Togo", "Bangladesh")) %>%
  ggplot(aes(x = year, y = lifeExp, colour = country)) +
  geom_line(linewidth = 1) +
  theme_af(legend = "bottom") +
  scale_colour_discrete_af() +
  scale_y_continuous(limits = c(0, 82),
                     breaks = seq(0, 80, 20),
                     expand = c(0, 0)) +
  scale_x_continuous(breaks = seq(1952, 2007, 5)) +
   labs(
    x = "Year",
    y = NULL)

You will need to reset the default ggplot2 theme if you don’t want to use afcharts any more.

ggplot2::theme_set(theme_grey())

7.2 A custom theme I like

# make this into a custom function that can be applied to any plot
theme_chris = function(){
  theme(
    axis.line = element_line(colour = "grey80"),
    panel.grid.major.y = element_line(colour = "grey90"),
    panel.grid.major.x = element_blank(), 
    panel.background = element_rect(fill = "white", colour = NA))
}

p + theme_chris()

8 Legends

ggplot will create a legend automatically if any aesthetic is mapped to a linetype, fill, or colour.

# by default, legends appear on the right
mpg %>%
  group_by(year, cyl) %>%
  summarise(mean_cty = mean(cty)) %>% 
  ggplot(aes(x = year, y= mean_cty, fill = factor(cyl))) +
  geom_bar(stat = "identity", position = position_dodge()) +
  ggtitle("right (default)")

# To change:

p + theme(legend.position = "none") 
p + theme(legend.position = "top") 
p + theme(legend.position = "bottom")

# custom location for legend
p + theme(legend.position = c(0.1, 0.9))
# alter legend box aesthetics, e.g. make transparent
p + theme(legend.background = element_blank())

To control legend wrapping (i.e. a single horizontal row instead of wrapping into two rows) - especially when legend.position = "bottom".

p + guides(colour = guide_legend(nrow = 1)) 

You can control the palette, breaks, labels, and name. E.g. if the factor labels are too long, you can shorten them. Use a separate call to contol other attributes like linetype or fill. In fact, the scale_*_manual functions are incredibly versatile for this kind of task (and are discussed in more detail below).

gcookbook::countries %>%
    filter(str_detect(Name, "United")) %>%
    ggplot(aes(x = Year, y = GDP)) + 
    geom_line(aes(colour = Name), linewidth = 0.8) +
    scale_colour_manual(values = c(RColorBrewer::brewer.pal(3, "Set2")),
                        breaks = c("United Kingdom", "United Arab Emirates", "United States"),
                        labels = c("UK", "UAE", "US"),
                        name = "Country") +
  theme_minimal()

8.1 Direct labels

Sometimes a legend isn’t necessary or the right aesthetic choice. Instead, it’s possible to append labels directly to dots or lines using directlabels or ggrepel.

Using directlabels and geom_dl. Note that other method options include first.points and last.qp which adjusts the size of the text automatically). This also requires clipping to be turned off and for the plot margins to be extended (otherwise text won’t be displayed properly).

df %>%
  ggplot(aes(x = time_period, y = total_exam_entries)) +
  geom_line(aes(colour = subject), linewidth = 0.8) +
  ylab("Entries") + 
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle=270, vjust = 0.5),
        legend.position = "none",
        axis.line = element_line(colour="black"),
        panel.grid.minor = element_blank()) +
  scale_y_continuous(labels = scales::label_comma(big.mark = ","),
                     limits = c(0,80000)) +
  scale_x_continuous(breaks = c(200910, 201011, 201112, 201213, 201314, 201415, 201516, 201617, 
                                201718, 201819, 201920, 202021, 202122, 202223),
                     labels = c("2009/10", "2010/11", "2011/12", "2012/13", "2013/14", "2014/15",
                                "2015/16", "2016/17", "2017/18", "2018/19", "2019/20", "2020/21",  
                                "2021/22", "2022/23")) +
  geom_dl(aes(label=subject), method=list("last.points", "bumpup", cex=0.8)) +
  coord_cartesian(clip="off") +
  theme(plot.margin = unit(c(1,4,1,1), "lines")) 

Using geom_label_repel. Note that this requires a bit more wrangling to make sure only the final points are displayed (using the data argument) - otherwise every single point will be labelled.

df %>%
  ggplot(aes(x = time_period, y = total_exam_entries, label = subject)) +
  geom_line(aes(colour = subject), linewidth = 0.8) +
  ylab("Entries") + 
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle=270, vjust = 0.5),
        legend.position = "none",
        axis.line = element_line(colour="black"),
        panel.grid.minor = element_blank()) +
  scale_y_continuous(labels = scales::label_comma(big.mark = ","),
                     limits = c(0,80000)) +
  scale_x_continuous(breaks = c(200910, 201011, 201112, 201213, 201314, 201415, 201516, 201617, 
                                201718, 201819, 201920, 202021, 202122, 202223),
                     labels = c("2009/10", "2010/11", "2011/12", "2012/13", "2013/14", "2014/15",
                                "2015/16", "2016/17", "2017/18", "2018/19", "2019/20", "2020/21",  
                                "2021/22", "2022/23")) +
  coord_cartesian(clip = "off") +
  geom_label_repel(aes(label = subject), 
                   label.padding = .15, 
                   data = df %>% group_by(subject) %>% filter(time_period == max(time_period)),
                   size = 3, hjust = 0.5, nudge_x=0.5)

9 Other things

9.1 Flip coordinates

ggplot2::diamonds %>%
  ggplot(aes(x=cut, y=carat)) +
  geom_violin() +
  coord_flip()

10 Using colours

10.1 Manual

Quick manual palettes for sequential and categorical variables (adjust to number needed).

seq2 = c("#12436D", "#6BACE6")
seq3 = c("#12436D", "#2073BC", "#6BACE6")
cat4 = c("#12436D", "#28A197", "#801650", "#F46A25", "#3D3D3D", "#A285D1")
# Sequential palettes
blues = c("#104F75", "#407291", "#7095AC", "#9FB9C8", "#CFDCE3")
reds = c("#8A2529", "#A15154", "#B97C7F", "#D0A8A9", "#E8D3D4")
oranges = c("#E87D1E", "#ED974B", "#F1B178", "#F6CBA5", "#FAE5D2")
yellows = c("#C2A204", "#CEB536", "#DAC768", "#E7DA87", "#F3ECCD")
greens = c("#004712", "#336C41", "#669171", "#99B5A0", "#CFDABD")
purples = c("#260859", "#51397A", "#7D6B9B", "#A89CBD", "#D4CEDE")

show_col(c(blues, reds, oranges, yellows, greens, purples), ncol=5, cex_label=0.7)

10.2 scale_*_manual

# example of sequential. Tip: use rev() to reverse the order of colours
ggplot2::diamonds %>%
  filter(price < 6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) +
  scale_fill_manual(values = rev(yellows))

An example of how to specify manual linetype and colour. Note that these must use the same breaks in order to be presented as a single legend (as opposed to separate legends for linetype and for colour):

scale_linetype_manual(name = "",
                        values = c("solid", "dashed"),
                        breaks = c("pay", "median_after_tax"),
                        labels = c("Pay", "Median UK")) +
scale_colour_manual(name = "",
                        values = c("black", "red"),
                        breaks = c("pay", "median_after_tax"),
                        labels = c("Pay", "Median UK"))

If you would like NA to be a different colour by default

plot +
  scale_fill_discrete(na.value = "white")

Use a focus palette to bring attention to one particular series, while keeping the other data there for comparison.

focus = c("#12436D", "#BFBFBF", "#BFBFBF") # contrast ratio of 5.57:1 (passes Web Content Accessibility
# Guidelines [WCAG]).
# Add as many grey colours as you need.

countries %>% 
  filter(str_detect(Name, "United")) %>%
  ggplot(aes(x=Year, y=GDP)) + 
  geom_line(aes(colour = Name), size = 0.8) +
  scale_colour_manual(values = focus,
                        breaks = c("United Kingdom", "United Arab Emirates", "United States"),
                        labels = c("UK", "UAE", "US"),
                        name = "Country")

Alternatively, use gghighlight:

countries %>% 
  filter(str_detect(Name, "Republic")) %>%
  ggplot(aes(x=Year, y=GDP)) + 
  geom_line(aes(colour = Name), linewidth = 0.8) +
  gghighlight(str_detect(Name, "Czech"),
              label_params = list(size = 3),
              label_key = Code)

10.3 Other colour scale options

# if a fill or colour variable is continuous, the default scale will be scale_*_continuous,
# so there's no need to specify it directly:
p = gcookbook::heightweight %>%
  ggplot(aes(x = heightIn, y = weightLb, colour = heightIn)) +
  geom_point(size = 2)

p + #scale_colour_continuous() +
  theme_void() +
  ggtitle("scale_colour_continuous")

p + scale_colour_gradient(low = "white", high = "red", na.value = "black") +
  theme_void() +
  ggtitle("scale_colour_gradient")

# scale_colour_gradient2 allows you to set a mid-point colour but you have to specify this from
# the dataframe. Of course you don't have to use the median as the 'midpoint' here. 
p + scale_colour_gradient2(low = "blue", mid = "white", high = "red",
  midpoint = median(gcookbook::heightweight$heightIn)) +
  theme_void()

# colour_gradientn allows you to specify your own set of colours for a whole spectrum.
# note: if you don't specify a vector of values, the colours will be evenly positioned along
# the scale. However, you can control this manually for instance if you want to highlight the very 
# top values. Just specify a vector of values between 0 and 1 to correspond to how you want the 
# colours mapped to the scale, e.g. values = c(1, 0.9, 0.8, 0.7, 0)
p + scale_colour_gradientn(colours = c("red","yellow","green","lightblue","darkblue"),
                                    values = c(1, 0.9, 0.8, 0.7, 0)) +
  theme_void() +
  ggtitle("scale_colour_gradientn")

10.4 Brewer palettes

scale_*_brewer to use predefined palettes. See a list of available palettes here.

10.4.1 Brewer monochrome palettes:

  • Blues
  • Greens
  • Purples
  • Greys
  • Oranges
  • Reds
ggplot2::diamonds %>%
  filter(price <6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) + 
  scale_fill_brewer(palette = "Greens") +
  theme_void()

10.4.2 Brewer spectral palettes

  • BuGn
  • BuPu
  • GnBu
  • OrRed
  • PuBu
  • PuRd
  • RdPu
  • YlGn
  • PuBuGn
  • YlGnBu
  • YlOrBr
  • YlOrRd
ggplot2::diamonds %>%
  filter(price <6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) + 
  scale_fill_brewer(palette = "YlOrRd") +
  theme_void()

10.4.3 Brewer diverging palettes

  • Spectral
  • RdYlGn
  • RdYlBu
  • RdGy
  • RdBu
  • PuOr
  • PRGn
  • PiYG
  • BrBG
ggplot2::diamonds %>%
  filter(price <6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) + 
  scale_fill_brewer(palette = "Spectral") +
  theme_void()

10.4.4 Brewer qualitative palettes

  • Accent
  • Dark2
  • Paired
  • Pastel1 and Pastel2
  • Set1
  • Set2
  • Set3
ggplot2::diamonds %>%
  filter(price <6000) %>%
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram(position = "dodge", binwidth = 1000) + 
  scale_fill_brewer(palette = "Set2") +
  theme_void()

Example of a categorical palette from the Analysis Function.

# note: recommendation is to use a max of 4 colours
categorical = c("#12436D", "#28A197", "#801650", "#F46A25", "#3D3D3D", "#A285D1")
show_col(categorical)

How to use a custom palette:

ggplot2::midwest %>%
  filter(county %in% c("ADAMS", "ALEXANDER", "BOND", "BOONE", "BROWN")) %>%
  ggplot(aes(x = county, y = poptotal, fill = county)) +
  geom_col(position = position_dodge()) +
  scale_fill_manual(values = categorical)

Using scale_fill_brewer palettes:

ggplot2::midwest %>%
  filter(county %in% c("ADAMS", "ALEXANDER", "BOND", "BOONE", "BROWN")) %>%
  ggplot(aes(x=county, y=poptotal, fill = county)) +
  geom_col(position = position_dodge()) +
  scale_fill_brewer(type = "seq", palette = "Set1")

11 Mapping

Use the sf package and a shapefile to visualise data on a map. This shapefile comes from the UK government’s Geoportal. This tool is also a good resource for finding different shapefiles.

library(sf)
library(tidyverse)

# firstly read in a shapefile. Note that when you download a shapefile, it comes with other 
# files like .shx. These all need to be in the same directory as the .shp file itself.
# As such, you can actually just specify the folder name that contains the shapefile and its
# accompanying files. 

eng_reg_map = st_read("./Shapefiles/Eng_regional_2023/RGN_DEC_2023_EN_BFC.shp")
Reading layer `RGN_DEC_2023_EN_BFC' from data source 
  `/Users/chrisdixon/G Drive/R/R script files/How to do stuff in R/Shapefiles/Eng_regional_2023/RGN_DEC_2023_EN_BFC.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 9 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 82668.52 ymin: 5336.966 xmax: 655653.8 ymax: 657536
Projected CRS: OSGB36 / British National Grid
# check that the map is appropriate for your needs by graphing it without any data attached
eng_reg_map %>%
  ggplot() +
  geom_sf(fill   = "white",
          colour = "black") +
  theme_void()

# next, add some data from a preexisting source:
df = data.frame(
  region = c("North West", "North East", "Yorkshire and The Humber", 
             "East of England", "West Midlands", "East Midlands", 
             "South East", "London", "South West"),
  n = c(81, 25, 39, 49, 51, 43, 53, 115, 32)
)

# join this to the sf object and map. 
# use scale_fill_gradient to control fill colours

eng_reg_map %>%
  left_join(df, by = c("RGN23NM" = "region")) %>%
  ggplot(aes(fill = n)) +
  geom_sf(colour = "grey50") +
  theme_void() +
  scale_fill_gradient(name = "Respondents",
                      low = "grey90", high = "grey15", 
                      na.value = "white")

To add points based on geometry:

# say you have a data frame from GIAS with easting and northing values of schools (these could be lat/long too)
map_df = data.frame(
school = c(1:15),
easting = c(427144, 390498, 391459, 453596, 332195, 439952, 397143, 331548, 504305, 527647, 349039, 437349, 504740, 448965, 166988),
northing = c(433927, 323814, 162039,  86591, 391270, 114064, 157408, 390872, 208312, 169328, 405788, 377158, 103181, 361343,  44078),
pupils  = c(28, 23, 59, 22, 28, 76, 13, 14, 57, 43, 25, 12, 59, 52, 31)
)

# the shapefile is based on long/lat, so you need to convert Eastings and Northings.
# The first step here is to check which CRS the original map uses
st_crs(eng_reg_map)$epsg
[1] 27700
# use the st_as_sf function to create another version of your dataframe (map_df) - feed in the crs
# and it will add an additional column called 'geometry'

schools.sf = st_as_sf(map_df, coords = c("easting", "northing"), remove = FALSE, crs = 27700)

# check that the points themselves work
# plot(schools.sf$geometry)

# add to the map
eng_reg_map %>%
  ggplot() +
  geom_sf(fill   = NA,
          colour = "grey50") +
  geom_sf(data = schools.sf, colour = "red", size = 2) +
  theme_void()

# optionally if you have used aes(colour = pupils): + scale_colour_gradient(low = "grey", high = "red")

12 Odd bits for sorting later

12.1 Vertical and horizontal lines

mpg %>% 
  ggplot(aes(x=displ, y = cty)) + 
  geom_point() + 
  geom_vline(xintercept = 4.5, colour = "red", linetype="dashed") +
  geom_hline(yintercept = 22.5, colour = "purple", linetype = "twodash")

Shading

# geom_rect
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() +
     annotate("rect", xmin = 3, xmax = 4.2, ymin = 12, ymax = 21,
        alpha = .2)

# to use with dates:
# annotate("rect", xmin = as.Date("2020-06-01"), xmax = as.Date("2021-06-01"), ymin = -Inf, ymax = Inf,
#           alpha = .2)

13 Combining and saving plots

13.1 Facetting

To facet by a single variable, use facet_wrap. Add scales = "free" if axes are different for each group (or try free_x or free_y). NOTE: the old way was ~ species, but vars allows you to specify several groupings, e.g. vars(species, sex).

palmerpenguins::penguins %>% 
  ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) + 
  geom_point(aes(colour = island)) +
  facet_wrap(vars(species), 
             nrow = 3,
             scales = "fixed") +
  theme_bw()

To facet by two variables, use facet_grid

palmerpenguins::penguins %>% 
  drop_na(sex) %>%
  ggplot(aes(x=bill_length_mm, y=flipper_length_mm)) + 
  geom_point() +
  facet_grid(species ~ sex) +
  theme_bw()

13.2 Arranging

Use ggarrange from the ggpubr package to arrange plots.

# plot 1
a = gcookbook::countries %>%
    filter(Name == "United Kingdom") %>%
    ggplot(aes(x = Year, y = GDP)) + 
    geom_line() +
    scale_y_continuous(labels = label_comma(big.mark = ",")) +
    theme_bw()

# plot 2
b = gcookbook::countries %>%
    filter(Name == "United Kingdom") %>%
    ggplot(aes(x = Year, y = healthexp)) + 
    geom_line() +
    scale_y_continuous(labels = label_comma(big.mark = ",")) +
    theme_bw()

# combine
ggarrange(a, b, 
          ncol=2, 
          widths = c(1,1),
          labels = c("A", "B"))

You can also nest ggarrange calls, e.g. if you have 3 or more plots

c = gcookbook::countries %>%
    filter(Name == "United Kingdom") %>%
    ggplot(aes(x = Year, y = infmortality)) + 
    geom_line() +
    theme_bw()

ggarrange(a,
          ggarrange(b, c, ncol = 2),
          nrow = 2)

It’s also possible to use a shared legend and title (if applicable to all plots)

a = gcookbook::countries %>%
    filter(str_detect(Name, "United")) %>%
    ggplot(aes(x = Year, y = GDP)) + 
    geom_line(aes(colour = Name, linetype = Name), linewidth = 0.8) +
    theme_bw()

# plot 2
b = gcookbook::countries %>%
    filter(str_detect(Name, "United")) %>%
    ggplot(aes(x = Year, y = healthexp)) + 
    geom_line(aes(colour = Name, linetype = Name), linewidth = 0.8) +
    theme_bw()

# wrap in annotate_figure for common title
annotate_figure(ggarrange(a, b, ncol=2,
          common.legend = T,
          legend = "bottom"),
    top = text_grob("A common title", color = "red", face = "bold", size = 14),
    bottom = text_grob("Data source: \n Countries data set", color = "blue",
                                  hjust = 1, x = 0.99, face = "italic", size = 10),
    fig.lab.pos = "top",
    fig.lab.size = 14)

13.3 Saving

Files will be saved in your current working directory.

my_plot = ggplot(...)

ppi = 300  # pixels per inch
png("file_name.png", width = 4*ppi, height = 4*ppi, res = ppi)
my_plot
dev.off()

13.4 Looping or iterating

Sometimes you want to produce the same graph for a range of different levels in a factor. This recipe below produces histograms, grouped by ‘EAL’ for every variable listed in the covariates vector. It uses purrr::map to iterate through this list and then invokes cowplot::plot_grid to add these all to one file.

hist_group_fun = function(x, y = NA) {
  ggplot(data, aes(x = .data[[x]], fill = .data[[y]], colour = .data[[y]]) ) +
    geom_histogram(alpha=0.5, position = "identity") +
    theme(legend.text = element_text(size=9))
}

hist_group_fun(x = "IDACI.rank19", y = "EAL")

# make a list of variables to loop through
covariates = c("cpm.raw", "bpvs.raw", "IDACI.rank19")

# iterate through covariates
covariate_plots = map(covariates, ~ hist_group_fun(.x, "EAL") )

# save to a single pdf file
pdf("histograms_all.pdf")
cowplot::plot_grid(plotlist = covariate_plots)
dev.off()

Another way to achieve this is by converting variable names to symbols. The function below creates dodged bar plots (with the same y-axis value each time), and then the map function iterates through a list of grouping variables (vars) such that each time a new plot is produced with the next grouping factor in the vector. The function uses tidy evaluation (the unquote or ‘bang-bang’ character !!)

bar_fun = function(data, var1, var2) {
  
  # convert grouping factors into symbols first
  var1 = sym(var1)
  var2 = sym(var2)
  
  data %>%
    filter(psc_pass == "Fail") %>%
    # the below effectively becomes group_by(c_summer_born, SEX)
    group_by(!!var1, !!var2) %>%
    summarise(PHONICSMARK = round(mean(PHONICSMARK, na.rm = TRUE), 1), .groups = "drop") %>%
    ggplot(aes(x = !!var1, y = PHONICSMARK, fill = !!var2)) +
    geom_bar(stat = "identity", position = position_dodge()) +
    geom_text(aes(label = PHONICSMARK), 
              vjust = 1.2, 
              position = position_dodge(width = 1)) +
    labs(title = paste("By", as_string(var2))) +
    theme_classic() +
    theme(axis.title = element_blank())
}

vars = c("SEX", "c_disad", "c_any_Sen", "c_white_british", "c_eal")
  
plots = purrr::map(vars, ~ bar_fun(data = ph_cen2, var1 = "c_summer_born", var2 = .x))

patchworK::wrap_plots(plots, ncol = 3)

This slightly different recipe takes an outcome variable and grouping factor to produce a histogram. It uses enquo to extract the number of levels in the specified grouping factor which it then supplies to the nrow argument of facet_wrap.

dist_graph = function(data, x, char){
  
  # quote the 'char' variable
  char_var = enquo(char)
  
  # find unique number of values in 'char' by unquoting
  width = data %>% 
    summarise(dist = n_distinct(!!char_var)) %>% as.numeric()
  
  data %>%
    ggplot(aes(x = {{x}}, fill = {{char}})) +
    geom_histogram(binwidth = 1, position  = "identity") +
    facet_wrap(vars({{char}}), 
               scales = "free",
               nrow = width) +
    theme(legend.position = "none",
          axis.title.y = element_blank()) +
    geom_vline(xintercept = 32, colour = "red")
}

dist_graph(c1_c2, x = PHONICSMARK_y1, char = MoB)